Contents

1 Introduction

“Celda” stands for “CEllular Latent Dirichlet Allocation”, which is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. Celda has advantages over other clustering frameworks:

  1. Celda can simultaneously cluster cells into cell subpopulations and genes into transcriptional states - subgroups of genes that are likely to be expressed in similar proportions - over different types of cells
  2. Celda uses count-based Dirichlet-multinomial distributions so no additional normalization is required for 3’ DGE single cell RNA-seq
  3. These types of models have shown good performance with sparse data.

In this vignette we will demonstrate how to perform cell and gene clustering with simulated and real single cell RNA-seq data using the Bayesian hierarchical models within Celda.

1.1 Installation Instructions

Currently Celda is hosted on a Github repository compbiomed/celda.

library(devtools)
source("https://bioconductor.org/biocLite.R")
install.packages("Rcpp")
biocLite("SummarizedExperiment")
biocLite("MAST")
install_github("compbiomed/celda")

This should automatically install Celda on your local machine as long as you have proper internet connection. If you encounter other problems during the installation you may want to open up a browser and access the issues page on our github repository at https://github.com/compbiomed/celda. If you have any other questions or concerns, you could also contact our Google Group at: https://groups.google.com/forum/#!forum/celda-list.

Once installed the package can be loaded using the library command.

library(celda)

Complete list of help files are accessible using the help command with a package option.

help(package=celda)


2 Single cell RNA-seq dataset analysis with Celda

We will learn how to use the package Celda to cluster genes and cells from count data generated in single cell RNA-seq experiments. Celda will take a matrix of counts where each row is a gene and each cell is a column and each entry in the matrix is the number of counts for each gene in each cell. For this tutorial, we will start with a simple matrix of counts that were simulated according to the generative process behind the celda_CG model.

We use the built-in data generating function simulateCells to simulate a single-cell RNA-Seq dataset. simulateCells returns a list containing a count matrix where each row is a gene and each column is a cell. The K parameter determines the number of cell clusters while the L parameter determines the number of gene clusters within the dataset. The S parameter determines the number of samples.

sim_counts <- simulateCells("celda_CG", K = 6, L = 6, S = 10)

The counts variable contains the counts matrix. The dimensions of counts matrix:

dim(sim_counts$counts)
## [1] 1000  822

The z variable contains the cluster labels for each cell. Here is the number of cells in each subpopulation:

table(sim_counts$z)
## 
##   1   2   3   4   5   6 
## 125 103 141  95 147 211

The y variable contains the transcriptional state assignment for each gene. Here is the number of genes in each transcriptional state:

table(sim_counts$y)
## 
##   1   2   3   4   5   6 
##  50  27 201 374 107 241

The sample.label is used to denote the sample from which each cell was derived. In this simulated dataset, we have 10 samples:

table(sim_counts$sample.label)
## 
##  Sample_1 Sample_10  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6 
##        86       100        94        88        95        73        58 
##  Sample_7  Sample_8  Sample_9 
##        66        75        87

Each cell is assumed to come from a sample. Here is the number of cells in each subpopulation within each sample:

table(sim_counts$z, sim_counts$sample.label)
##    
##     Sample_1 Sample_10 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7
##   1       25         1        2       14       11       11       12       19
##   2        5         1        9        6       19        5        7       20
##   3        3        54        2       21       10        9       14       11
##   4        3         1       32       15       11        5        2        4
##   5       25         0        2       30       31       40        5        8
##   6       25        43       47        2       13        3       18        4
##    
##     Sample_8 Sample_9
##   1       13       17
##   2       19       12
##   3       15        2
##   4       14        8
##   5        6        0
##   6        8       48

2.1 Running the Celda function

There are currently three models within this package: celda_C will cluster cells, celda_G will cluster genes, and celda_CG will simultaneously cluster cells and genes. The celda function will use these models and take in a counts matrix to create a celda.list object that contains the cluster labels for each cell and gene.

celda.res <- celda(counts = sim_counts$counts, model = "celda_CG", K = 6, L = 6, max.iter = 10, cores = 1, nchains = 1)

The celda.list contains 4 objects: “run.params”, “res.list”,“content.type”,and “count.checksum”. The Celda model(s) will be stored within the “res.list”.

Here is a comparison between the true cluster labels and the estimated cluster labels.

names(celda.res)
## [1] "run.params"     "res.list"       "content.type"   "count.checksum"
model <- celda.res$res.list[[1]]
z <- model$z
y <- model$y

table(z, sim_counts$z)
##    
## z     1   2   3   4   5   6
##   1 125   0   0   0   0   0
##   2   0 103   0   0   0   0
##   3   0   0 141   0   0   0
##   4   0   0   0  95   0   0
##   5   0   0   0   0   0 211
##   6   0   0   0   0 147   0
table(y, sim_counts$y)
##    
## y     1   2   3   4   5   6
##   1  49   0   0   0   0   0
##   2   0  27   0   0   0   0
##   3   0   0 200   1   0   0
##   4   1   0   1 373   0   1
##   5   0   0   0   0 107   0
##   6   0   0   0   0   0 240

We can display the clustering results with a heatmap of the normalized counts, obtained by the function normalizeCounts, which normalizes a counts matrix by a scalar factor.

norm.counts <- normalizeCounts(sim_counts$counts, scale.factor = 1e6)
renderCeldaHeatmap(counts = norm.counts, z=z, y=y, normalize = NULL, 
                   color_scheme = "divergent",cluster_gene = TRUE, 
                   cluster_cell = TRUE)

2.2 Matrix factorization

Celda can also perform matrix factorization to summarize the contribution of each transcriptional state to each cellular subpopulation. Matrix factorization decomposes the counts matrix into multiple matrices, using the given Celda model. Each of these following matrices can be viewed as raw counts values, proportional values, or posterior probability values.

factorized <- factorizeMatrix(model, sim_counts$count)
names(factorized)
## [1] "counts"      "proportions" "posterior"

The gene.states contains each gene’s contribution to the transcriptional state it belongs. For our example, the matrix contains 1000 genes to 6 transcriptional state.

dim(factorized$proportions$gene.states)
## [1] 1000    6
head(factorized$proportions$gene.states)
##        L1 L2         L3          L4          L5 L6
## Gene_1  0  0 0.00000000 0.000000000 0.002372693  0
## Gene_2  0  0 0.00000000 0.000000000 0.008278966  0
## Gene_3  0  0 0.02276603 0.000000000 0.000000000  0
## Gene_4  0  0 0.00000000 0.002711210 0.000000000  0
## Gene_5  0  0 0.00000000 0.004446981 0.000000000  0
## Gene_6  0  0 0.00000000 0.002785753 0.000000000  0

The cell.states contains each transcriptional states’s contribution to each cell subpopulation. Here, there are 6 transcriptional states to 822 cells.

dim(factorized$proportions$cell.states)
## [1]   6 822
factorized$proportions$cell.states[,1:8]
##         Cell_1     Cell_2      Cell_3      Cell_4      Cell_5      Cell_6
## L1 0.193893130 0.08523207 0.193387562 0.188118812 0.139823009 0.168926773
## L2 0.206106870 0.04050633 0.193125164 0.201980198 0.009734513 0.192173576
## L3 0.387786260 0.01687764 0.399632642 0.378217822 0.090265487 0.420767145
## L4 0.053435115 0.02784810 0.057727631 0.057425743 0.293805310 0.040294460
## L5 0.004580153 0.26160338 0.004198373 0.003960396 0.085840708 0.001549787
## L6 0.154198473 0.56793249 0.151928628 0.170297030 0.380530973 0.176288260
##       Cell_7      Cell_8
## L1 0.0364931 0.134573778
## L2 0.1695594 0.009519688
## L3 0.1379617 0.079186499
## L4 0.3311081 0.284725227
## L5 0.1014686 0.101254868
## L6 0.2234090 0.390739939

The population.states contains each transcriptional states’ contribution to each of the cell states. Since K and L were set to be 6, there are 6 cell supopulations to 6 transcriptional states.

pop.states <- factorized$proportions$population.states
dim(pop.states)
## [1] 6 6
pop.states
##             K1         K2         K3          K4         K5         K6
## L1 0.191743984 0.15050892 0.08830247 0.085650270 0.13057352 0.03568253
## L2 0.194996281 0.19499099 0.04624651 0.006115229 0.01112842 0.16329832
## L3 0.405224153 0.43362699 0.01274006 0.161694744 0.08475810 0.13971169
## L4 0.047080010 0.16058868 0.03708449 0.269110512 0.29805329 0.32843685
## L5 0.003092693 0.03386761 0.24896449 0.318763477 0.08426042 0.10126270
## L6 0.157862880 0.02641680 0.56666199 0.158665768 0.39122626 0.23160791

2.3 Visualization of the Celda model

Celda contains several plotting tools for data dimensionality reduction tool outputs. The PCA result for the factorized matrix is plotted by the plotDrCluster and plotDrState function based off of the Celda clusters and state probabilities for each cell respectively:

data.pca <- prcomp(t(scale(t(factorized$proportions$cell.states))),
                   scale = F, center = F)

plotDrCluster(dim1 = data.pca$rotation[,1], dim2 = data.pca$rotation[,2], 
              cluster = celda.res$res.list[[1]]$z)

plotDrState(dim1 = data.pca$rotation[,1], dim2 = data.pca$rotation[,2],
            matrix = factorized$proportions$cell.states, rescale = TRUE)

In addition to the counts matrix, the transcriptional and cell states of a Celda model can also be visualized with a heatmap, which may be useful in determining which cell states highly express a particular transcriptional state. Z-scoring the matrix with the scale parameter can be used for visualization purposes.

renderProbabilityHeatmap(model = model, counts = sim_counts$counts, 
                         relative = FALSE, scale = TRUE)

Sometimes it may be helpful to visualize the relative probability of each transcriptional state in each cellular subpopulation. We can do this by using the relative parameter:

renderProbabilityHeatmap(model = model, counts = sim_counts$counts, 
                         relative = TRUE, scale = TRUE)

Without z-scoring:

renderProbabilityHeatmap(model = model, counts = sim_counts$counts, 
                         relative = TRUE, scale = FALSE)


3 Exploration of the PBMC dataset with Celda

We will now apply Celda on the “pbmc_data” dataset, which is a single-cell RNA-seq dataset of 3000 Peripheral Blood Mononuclear Cells (PBMC), available from 10X Genomics. The rows are organized by gene names, while the columns are organized by barcodes. For this tutorial, the dataset has been slightly modified so the rownames are comprised of the Ensembl gene ID as well as the gene name. The raw dataset can be found at https://support.10xgenomics.com/single-cell/software/pipelines/latest/rkit.

To reduce computational time, we will only include genes with at least 4 counts in a least 4 cells for this analysis, which we have done already for you. However, Celda is capable of analyzing genes with fewer counts.

##Original pbmc_data: 7090 genes, 2700 cells
#pbmc_select <- pbmc_data[rowSums(pbmc_data>3) > 3,]

dim(pbmc_select)
## [1] 1321 2700
head(rownames(pbmc_select))
## [1] "ENSG00000000938_FGR"     "ENSG00000002549_LAP3"   
## [3] "ENSG00000002586_CD99"    "ENSG00000003056_M6PR"   
## [5] "ENSG00000004059_ARF5"    "ENSG00000004779_NDUFAB1"
head(colnames(pbmc_select))
## [1] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1"
## [4] "AAACCGTGCTTCCG-1" "AAACCGTGTATGCG-1" "AAACGCACTGGTAC-1"

3.1 Identifying the optimal number of cell subpopulations and transcriptional states

The optimal number of K and L will likely not be known a priori. Therefore multiple choices of K and L may need to tested and compared. Additionally, Celda is sensitive to initial start conditions, so it is good practice to run multiple chains for each combination of K and L to increase the chances of finding the most optimal solution. Celda is able to run multiple combinations of K and L with multiple chains for each combination in parallel.

pbmc_res <- celda(pbmc_select, K = seq(5:50,by = 5), L = seq(10:50,by = 5), 
                  cores = 1, model = "celda_CG", nchains = 4, max.iter = 100)

The Celda package contains several methods to determine the optimal number of K upon creating the models. calculatePerplexityWithResampling calculates the perplexity for every model created, which provide a distribution of perplexities which can be visualized to determine which model to use for downstream analysis.

calc.perplexity <- calculatePerplexityWithResampling(pbmc_res, 
                                                     counts = pbmc_select, 
                                                     resample = TRUE)
visualizePerplexityByKL(calc.perplexity$perplexity.info)

The function gettingClusters attempts to identify the optimal number for K by looking for differential expression between each cell clusters in a single model. The data can be trimmed (i.e, all values higher than 50 are trimmed to 50) for easier visualization.

cluster.plot <- gettingClusters(celda.list = pbmc_res, matrix = pbmc_select, iterations = 3)
cluster.plot$data$negative_log[cluster.plot$data$negative_log > 50] <- 50
cluster.plot

In this example, it appears that K = 15 at L = 50 gives the best model.

We can use the function getBestModel to select the best model with the highest log-likelihood from the list of models once we have selected the optimal K and L. The replicate with the highest log likelihood will be returned from all models that were run with the selected K and L.

model.pbmc <- getBestModel(pbmc_res, K = 15, L = 50)

3.2 Analysis of PBMC Celda model

renderProbabilityHeatmap, as discussed above visualizes the relationship between each transcriptional state and cell subpopulation in a heatmap form.

renderProbabilityHeatmap(counts = pbmc_select, model = model.pbmc, 
                         relative = TRUE, scale = TRUE)

Celda supports the use of tSNE (t-stochastic neighbor embedding) via the celdaTsne function. We can plot these tSNE results via plotDrCluster, plotDrState and plotDrGene. These will determine how each cell is labeled via celda, visualize each cell’s expression respectively of a specific transcriptional state and visualize each cell’s expression of a set of genes.

factorize.matrix <- factorizeMatrix(model.pbmc, counts=pbmc_select)
norm_pbmc <- normalizeCounts(factorize.matrix$counts$cell.states)
set.seed(123)
pbmc_tsne <- celdaTsne(counts = pbmc_select, celda.mod = model.pbmc, distance = "cosine")

With the use of the plots, it is also possible to determine which cell clusters express specific marker genes.

plotDrCluster(dim1 = pbmc_tsne[,1], dim2 = pbmc_tsne[,2], cluster = as.factor(model.pbmc$z))

plotDrState(dim1 = pbmc_tsne[,1], dim2 = pbmc_tsne[,2], matrix = factorize.matrix$proportions$cell.states, rescale = TRUE)

marker.genes <- c("ENSG00000168685_IL7R","ENSG00000198851_CD3E","ENSG00000105374_NKG7",
"ENSG00000203747_FCGR3A","ENSG00000090382_LYZ","ENSG00000179639_FCER1A",
"ENSG00000156738_MS4A1", "ENSG00000163736_PPBP","ENSG00000101439_CST3")
gene.counts <- pbmc_select[marker.genes,]
plotDrGene(dim1 = pbmc_tsne[,1],dim2 = pbmc_tsne[,2], matrix = gene.counts, 
           rescale = TRUE)

The GiniPlot function utilizes the Gini coefficient, a statistical dispersion measure, to determine the significance of each transcriptional states. The plot will give users a rough estimate of which transcriptional states to look for first.

gini <- GiniPlot(counts = pbmc_select, celda.mod = model.pbmc)

gini

Celda employs the MAST package (McDavid A, 2018) for differential expression analysis of single-cell data. The diffExpBetweenCellStates function determines genes that are differentially expressed in one cell subpopulation against all other subpopulations, or between two cell subpopulations.

If you wish to compare one cell subpopulation compared to all others, leave c2 as NULL:

diff.exp.clust1 <- diffExpBetweenCellStates(counts = pbmc_select, 
                                            celda.mod = model.pbmc, 
                                            c1 = 1, c2 = NULL)

head(diff.exp.clust1,10)
##                          Gene   Pr(>Chisq)   log2fc          fdr
##  1:      ENSG00000163736_PPBP 5.959488e-41 13.71396 1.574497e-38
##  2:       ENSG00000163737_PF4 4.966373e-32 12.72657 3.644766e-30
##  3:     ENSG00000127920_GNG11 7.455046e-39 12.43603 1.633866e-36
##  4:      ENSG00000168497_SDPR 4.863507e-38 12.42767 7.138548e-36
##  5: ENSG00000180573_HIST1H2AC 9.687060e-39 11.41711 1.633866e-36
##  6:     ENSG00000101162_TUBB1 3.279999e-34 11.16967 2.888585e-32
##  7:       ENSG00000120885_CLU 4.297924e-34 11.11347 3.548473e-32
##  8:       ENSG00000010278_CD9 5.841343e-33 10.98146 4.539067e-31
##  9:     ENSG00000171611_PTCRA 1.561938e-25 10.88895 6.068588e-24
## 10:      ENSG00000154146_NRGN 1.412774e-30 10.57494 9.331373e-29

If you wish to compare two cell subpopulations, use both c1 and c2 parameters.

diff.exp.clust1vs2 <- diffExpBetweenCellStates(counts = pbmc_select, celda.mod = model.pbmc, c1 = 6, c2 = 7)

diff.exp.clust1vs2 <- diff.exp.clust1vs2[diff.exp.clust1vs2$fdr < 0.25,]

The top upregulated genes in the first cluster:

head(diff.exp.clust1vs2[order(diff.exp.clust1vs2$log2fc, decreasing = TRUE),],10)
##                       Gene   Pr(>Chisq)   log2fc          fdr
##  1: ENSG00000158869_FCER1G 1.038059e-26 6.283397 1.958965e-24
##  2:   ENSG00000100453_GZMB 2.992048e-44 6.128375 3.952495e-41
##  3: ENSG00000011600_TYROBP 1.094117e-24 6.095883 1.192391e-22
##  4:   ENSG00000115523_GNLY 1.091942e-19 6.029591 7.591871e-18
##  5:  ENSG00000159674_SPON2 5.322953e-25 5.754154 7.031621e-23
##  6: ENSG00000203747_FCGR3A 3.526859e-24 5.393152 3.327844e-22
##  7: ENSG00000137441_FGFBP2 8.578263e-25 5.066204 1.030171e-22
##  8: ENSG00000163453_IGFBP7 6.102526e-19 4.630742 3.358932e-17
##  9:  ENSG00000198821_CD247 1.173435e-24 4.535172 1.192391e-22
## 10:    ENSG00000173762_CD7 1.572047e-16 4.478044 7.160945e-15

The top downregulated genes in the first cluster:

head(diff.exp.clust1vs2[order(diff.exp.clust1vs2$log2fc),],10)
##                         Gene   Pr(>Chisq)    log2fc          fdr
##  1:     ENSG00000167286_CD3D 7.855456e-32 -7.154195 3.459019e-29
##  2:     ENSG00000153563_CD8A 1.180135e-14 -4.132968 4.585172e-13
##  3:     ENSG00000113088_GZMK 1.594510e-14 -3.990027 5.850966e-13
##  4:     ENSG00000198851_CD3E 4.258257e-11 -3.965626 9.221570e-10
##  5:   ENSG00000241343_RPL36A 8.878622e-16 -3.892339 3.554139e-14
##  6:     ENSG00000169442_CD52 3.187762e-12 -3.605618 8.145610e-11
##  7:     ENSG00000172116_CD8B 6.502578e-14 -3.571881 2.045216e-12
##  8: ENSG00000196126_HLA-DRB1 1.953464e-10 -3.535898 3.851530e-09
##  9:      ENSG00000026025_VIM 4.738521e-07 -3.237857 6.018833e-06
## 10:     ENSG00000008517_IL32 1.127489e-08 -3.171722 1.838781e-07

Upon matrix factorization of the counts data, the top genes in a transcriptional state can be selected using the topRank function on the “gene.states” matrix:

factorize.matrix <- factorizeMatrix(model.pbmc, counts=pbmc_select)
top.genes <- topRank(factorize.matrix$proportions$gene.states, n = 25)
top.genes$names$L37
##  [1] "ENSG00000163736_PPBP"      "ENSG00000163737_PF4"      
##  [3] "ENSG00000180573_HIST1H2AC" "ENSG00000168497_SDPR"     
##  [5] "ENSG00000127920_GNG11"     "ENSG00000102804_TSC22D1"  
##  [7] "ENSG00000101162_TUBB1"     "ENSG00000010278_CD9"      
##  [9] "ENSG00000154146_NRGN"      "ENSG00000120885_CLU"      
## [11] "ENSG00000104267_CA2"       "ENSG00000171611_PTCRA"    
## [13] "ENSG00000113140_SPARC"     "ENSG00000169704_GP9"      
## [15] "ENSG00000101335_MYL9"      "ENSG00000005961_ITGA2B"   
## [17] "ENSG00000184113_CLDN5"
top.genes$names$L33
## [1] "ENSG00000163220_S100A9"  "ENSG00000143546_S100A8" 
## [3] "ENSG00000163221_S100A12" "ENSG00000162444_RBP7"   
## [5] "ENSG00000110203_FOLR3"

We can make a heatmap of these top genes as follows:

top.genes.ix <- unique(unlist(top.genes$index))
norm.pbmc.counts <- normalizeCounts(pbmc_select)
renderCeldaHeatmap(norm.pbmc.counts[top.genes.ix,], z = model.pbmc$z, 
                   y = model.pbmc$y[top.genes.ix], normalize = NULL, 
                   color_scheme = "divergent")

While some transcriptional states contribute to the identification of cell subpopulations, some states have low statistical value to the mode, and users may find it useful to re-create the heatmap using only states that have a high enough Gini coefficient value to create a “cleaner” heatmap. For this example, we will use states that have a Gini coefficient higher than 0.3. This is an arbitrary cutoff, and users can set their own threshold for the cutoff.

filtered.tr.states <- gini$data$Transcriptional_States[gini$data$Gini_Coefficient > 0.3]

top.genes.filtered.ix <- unique(unlist(top.genes$index[as.numeric(levels(filtered.tr.states))][as.numeric(filtered.tr.states)]))

renderCeldaHeatmap(norm.pbmc.counts[top.genes.filtered.ix,], z = model.pbmc$z, 
                   y = model.pbmc$y[top.genes.filtered.ix], 
                   normalize = NULL, color_scheme = "divergent")

If there is a particular gene of interest, it is possible to check its transcriptional state with the lookupTranscriptionalStateofGene function.

lookupTranscriptionalStateofGene(counts = pbmc_select, model = model.pbmc, 
                                 gene = c("ENSG00000203747_FCGR3A",
                                          "ENSG00000163736_PPBP"))
## $ENSG00000203747_FCGR3A
## [1] 39
## 
## $ENSG00000163736_PPBP
## [1] 37

stateHeatmap creates a heatmap using only the genes from a specific transcriptional state, which enables the visualization of co-expression patterns of genes within the transcriptional state.

stateHeatmap(counts = pbmc_select, celda.mod = model.pbmc, state.use = 17)